Data understanding¶

Data description:: The data is collected from an engineering facility. It contains some indicators regarding the ESH (Energy Savings and Health) Targets on a monthly basis. The data dates back as far as January 2020 for some indicators, and is last updated in September 2024.

Below is the list of indicators.

  • total energy consumption (kWh)
  • total CO2 emission (kg)
  • total water consumption (m3)
  • max electric usage demand (kW)
  • amount light oil used (kg)
  • general waste
    • t waste total (kg)
      • non hazardous waste for recycling (kg)
        • recycling rate
      • non-hazardous waste for landfill/elimination (kg)
  • industrial waste
    • t waste total
      • non-hazardous waste for recycling (kg)
        • non-hazardous recycling rate
      • non-hazardous waste for landfill/elimination (kg)
      • hazardous waste for recycling (kg)
        • hazardous recycling rate
      • hazardous waste for landfill/elimination (kg)
    • wood chips
    • cardboard

Importing libraries and data¶

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
from sklearn.metrics import root_mean_squared_error

To expedite data analysis, I have re-structured the data by using Excel to concatenate the 4 files along the horizontal axis to form a single data set in cleaned_data.xlsx

In [2]:
df = pd.read_excel('cleaned_data.xlsx')
df = df.melt(id_vars = ['ESH Aspects or Topic', 'Objectives', 'Indicator'], var_name = 'month')

df['month'] = pd.to_datetime(df['month'], format='%Y-%b')
df = df.dropna()
In [3]:
df.head()
Out[3]:
ESH Aspects or Topic Objectives Indicator month value
0 Use of energy (direct / indirect) Reduction of total energy consumption (kWh) total energy consumption (kWh) 2020-01-01 188239.0
1 Use of energy (direct / indirect) Reduction of CO2 emissions (kg) total CO2 emission (kg) 2020-01-01 60451.0
2 Use of water Reduction of total water consumption total water consumption (m3) 2020-01-01 369.0
3 Contact electric demand Reduction of max usage max electric usage demand (kW) 2020-01-01 526.0
5 General waste Reduction of total waste generation general t waste total (kg) 2020-01-01 896.0

Overall trend¶

First, I plot all the indicators to check if there's any interesting trends.

We can observe There are some clear signs of yearly seasonality in a few of the indicators.

  • Total energey consumption
  • CO2 emissions
  • Electric usage demand
  • General t waste total (and)
  • Industrial t waste total

For other indicators, it is less clear whether there was any seasonality.

In [4]:
import numpy as np
np.corrcoef(df[df['Indicator'] == 'total energy consumption (kWh)']['value'],df[df['Indicator'] == 'total CO2 emission (kg)']['value'])
Out[4]:
array([[1.        , 0.64962757],
       [0.64962757, 1.        ]])
In [5]:
import plotly.offline
plotly.offline.init_notebook_mode()
for indicator in df['Indicator'].unique():
    fig = px.line(data_frame = df[df['Indicator'] == indicator], y= 'value', x='month', title = indicator, markers = True)
    fig.update_layout(yaxis_title = indicator)
    # fig.update_traces(textposition="bottom right")
    fig.show()

Correlation check¶

Next I check for the correlations between some selected indicators.

It is inteeresting to see that total energy consumption has a relatively high positive correlation with CO2 emission (0.65) and electric usage demand (0.87). When looking at the line charts, we can actually see that these 3 indicators indeed share a very similar seasonlity: highest peak in August and minor peak in December, follow by rapid declines.

Another thing to note is that general waste and industrial wastes indicators are also highly correlated (0.74).

In [6]:
indicators_to_visualize = ['total energy consumption (kWh)',
 'total CO2 emission (kg)',
 'total water consumption (m3)',
 'max electric usage demand (kW)',
 'general t waste total (kg)',
 'industrial t waste total (kg)',
 'Wood chips',
 'Cardboard']

pivot_df = df.pivot(columns = 'Indicator', values = 'value', index = 'month')[indicators_to_visualize] # turn each indicator into a column

plt.figure(figsize=(10, 8))
sns.heatmap(pivot_df.corr(), annot=True,         # Show values on the heatmap
    fmt=".2f",          # Format the annotations
    cmap="coolwarm",    # Colormap
    linewidths=0.5,     # Line width between cells
    cbar_kws={'shrink': 0.8},  # Adjust color bar size
    square=True         # Make cells square-shaped
    )
Out[6]:
<Axes: xlabel='Indicator', ylabel='Indicator'>
No description has been provided for this image

Times series forecasting¶

Some obvervations after taking a deeper look at total energy consumption:

  • Seasonality: Every year, the number seems to decline from January to April-May, then increasing quickly to a yearly peak in August, before falling until November and rising to a second but lesser peak in December.
  • Trend: Generally, there is a downward trend from 2020 to 2023, but consumption seems to rise again in 2024.
    • This is also evident when looking at the YoY comparison in the original excel sheets, with YoY comparison being less then 100% between 2020-2023 (indicating there was a decline from the same month of the previous year), but over 100% in 2024.

With such clear trend and seasonality, I decide to try to apply time series forecasting to see if we can predict future values.

The method I use for this part is Prophet, a package by Meta (formerly Facebook). It is a package for time series forecasting that works relatively well in datasets with trends, seasonilities, etc. and requires minimal configuration.

Looking at the plot of actual and predicted values, we can see that the model does a decent job when predicting on seen data. It also captures relatively well the trend and seasonality that we observed above.

In [7]:
# putting the data into the format required by Prophet
data = df[df['Indicator'] ==  'total energy consumption (kWh)'][['month', 'value']]
data['month'] = pd.to_datetime(data['month'], format='%Y-%b')
data.columns = ['ds', 'y']

# fitting the model and making prediction
m = Prophet()
m.fit(data)

future = m.make_future_dataframe(periods=3, freq='MS')
forecast = m.predict(future)

# plotting the results
plot_plotly(m, forecast)

# forecast = forecast.merge(right = data, on = 'ds')

# fig1 = px.line(forecast,'ds', ['y','yhat'])
# fig1.show()
19:32:55 - cmdstanpy - INFO - Chain [1] start processing
19:32:55 - cmdstanpy - INFO - Chain [1] done processing
In [8]:
forecast = forecast.merge(right = data, on = 'ds')

fig1 = px.line(forecast,'ds', ['y','yhat'])
fig1.show()

To better evaluate the model, I try to calculate cross validation RMSE. The idea is inspired by the book Forecasting: Principles and Practice (chapter 5)

Basically, I only use data upto and excluding a certain month for training, then make prediction for that particular month and measure error (error = y_true - y_hat). Repeat that process several times to get the cross validation result. Here I use the first 9 months for cross validation.

Here we can see that the cross-validation RMSE is around 39000. Not bad!

In [9]:
import datetime
months = [datetime.date(2024, m, 1).strftime('%Y-%m-%d') for m in range(1, 10)]
print(months)
['2024-01-01', '2024-02-01', '2024-03-01', '2024-04-01', '2024-05-01', '2024-06-01', '2024-07-01', '2024-08-01', '2024-09-01']
In [10]:
cv_squared_errors = []
for month in months:
    m = Prophet()
    m.fit(data[data['ds'] < month])


    future = m.make_future_dataframe(periods=1, freq='MS')
    forecast = m.predict(future)

    forecast = forecast.merge(right = data, on = 'ds', how = 'left')
    forecast['squared_error'] = forecast['y'] - forecast['yhat']
    forecast['squared_error'] = forecast['squared_error']**2
    
    cv_squared_errors.append(forecast[forecast['ds'] == month]['squared_error'].iloc[0])

    # fig1 = px.line(forecast,'ds', ['y','yhat'])
    # fig1.show()
19:32:55 - cmdstanpy - INFO - Chain [1] start processing
19:32:55 - cmdstanpy - INFO - Chain [1] done processing
19:32:55 - cmdstanpy - INFO - Chain [1] start processing
19:32:55 - cmdstanpy - INFO - Chain [1] done processing
19:32:55 - cmdstanpy - INFO - Chain [1] start processing
19:32:55 - cmdstanpy - INFO - Chain [1] done processing
19:32:55 - cmdstanpy - INFO - Chain [1] start processing
19:32:55 - cmdstanpy - INFO - Chain [1] done processing
19:32:55 - cmdstanpy - INFO - Chain [1] start processing
19:32:55 - cmdstanpy - INFO - Chain [1] done processing
19:32:55 - cmdstanpy - INFO - Chain [1] start processing
19:32:56 - cmdstanpy - INFO - Chain [1] done processing
19:32:56 - cmdstanpy - INFO - Chain [1] start processing
19:32:56 - cmdstanpy - INFO - Chain [1] done processing
19:32:56 - cmdstanpy - INFO - Chain [1] start processing
19:32:56 - cmdstanpy - INFO - Chain [1] done processing
19:32:56 - cmdstanpy - INFO - Chain [1] start processing
19:32:56 - cmdstanpy - INFO - Chain [1] done processing
In [11]:
cv_rmse = np.sqrt(np.mean(cv_squared_errors))
print(cv_rmse)
39048.135909465585

Conclusion and Future work¶

In this notebook, I have performed a quick analysis of the given indicators concerning ESH targets, including data transformation, visual analysis and correlation analysis. I also applied times series forecasting on one of the variables.

The analysis here can be significantly benefit from better understanding of the domain subject and meaning of each indicator, as well as a clear objective for the analysis. Moreover, other time series forecasting for other variables can also be attempted. Other forecasting methods, namely ARIMA and deep neural networks, should also be considered and evaluated.